COVID-19 Data Report

Introduction

The Data

The data used for this report is courtesy of John Hopkins University COVID-19. The datasets can be found here.

The first two datasets used include a daily count of the cases and deaths, respectively, for every country. The last two datasets include a daily count of the cases and deaths, respectively, for each state of the United States (US). The data ranges between January 2020 and March 2023. The global data is subdivided by province or state, whereas the US data is subdivided by state and county. All data frames contain information about the latitude and longitude corresponding to a given location.

The data can be read in through the URLs below. Additionally, the packages that will be used for this report have been included below for reference. Note: Please install these packages, if required.

#load in packages
library(tidyverse)
library(ggplot2)
library(lubridate)
library(countrycode)
library(plotly)

#read in files
url_in <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
file_names <-
  c(
    "time_series_covid19_confirmed_global.csv",
    "time_series_covid19_deaths_global.csv",
    "time_series_covid19_confirmed_US.csv",
    "time_series_covid19_deaths_US.csv"
  )
#string concatenate file names
urls <- str_c(url_in, file_names)

#read in all files using urls
global_cases <- read_csv(urls[1])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_deaths <- read_csv(urls[2])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
US_cases <- read_csv(urls[3])
## Rows: 3342 Columns: 1154
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1148): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
US_deaths <- read_csv(urls[4])
## Rows: 3342 Columns: 1155
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1149): UID, code3, FIPS, Lat, Long_, Population, 1/22/20, 1/23/20, 1/24...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#show summary statistics and quick overview of each frame. 
head(global_cases)
## # A tibble: 6 × 1,147
##   `Province/State` `Country/Region`   Lat  Long `1/22/20` `1/23/20` `1/24/20`
##   <chr>            <chr>            <dbl> <dbl>     <dbl>     <dbl>     <dbl>
## 1 <NA>             Afghanistan       33.9 67.7          0         0         0
## 2 <NA>             Albania           41.2 20.2          0         0         0
## 3 <NA>             Algeria           28.0  1.66         0         0         0
## 4 <NA>             Andorra           42.5  1.52         0         0         0
## 5 <NA>             Angola           -11.2 17.9          0         0         0
## 6 <NA>             Antarctica       -71.9 23.3          0         0         0
## # … with 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>,
## #   `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## #   `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>,
## #   `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>,
## #   `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## #   `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>,
## #   `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, …
head(global_deaths)
## # A tibble: 6 × 1,147
##   `Province/State` `Country/Region`   Lat  Long `1/22/20` `1/23/20` `1/24/20`
##   <chr>            <chr>            <dbl> <dbl>     <dbl>     <dbl>     <dbl>
## 1 <NA>             Afghanistan       33.9 67.7          0         0         0
## 2 <NA>             Albania           41.2 20.2          0         0         0
## 3 <NA>             Algeria           28.0  1.66         0         0         0
## 4 <NA>             Andorra           42.5  1.52         0         0         0
## 5 <NA>             Angola           -11.2 17.9          0         0         0
## 6 <NA>             Antarctica       -71.9 23.3          0         0         0
## # … with 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>,
## #   `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## #   `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>,
## #   `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>,
## #   `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## #   `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>,
## #   `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, …
head(US_cases)
## # A tibble: 6 × 1,154
##        UID iso2  iso3  code3  FIPS Admin2  Province_State Country_Region   Lat
##      <dbl> <chr> <chr> <dbl> <dbl> <chr>   <chr>          <chr>          <dbl>
## 1 84001001 US    USA     840  1001 Autauga Alabama        US              32.5
## 2 84001003 US    USA     840  1003 Baldwin Alabama        US              30.7
## 3 84001005 US    USA     840  1005 Barbour Alabama        US              31.9
## 4 84001007 US    USA     840  1007 Bibb    Alabama        US              33.0
## 5 84001009 US    USA     840  1009 Blount  Alabama        US              34.0
## 6 84001011 US    USA     840  1011 Bullock Alabama        US              32.1
## # … with 1,145 more variables: Long_ <dbl>, Combined_Key <chr>,
## #   `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>, `1/25/20` <dbl>,
## #   `1/26/20` <dbl>, `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>,
## #   `1/30/20` <dbl>, `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>,
## #   `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>,
## #   `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>,
## #   `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, …
head(US_deaths)
## # A tibble: 6 × 1,155
##        UID iso2  iso3  code3  FIPS Admin2  Province_State Country_Region   Lat
##      <dbl> <chr> <chr> <dbl> <dbl> <chr>   <chr>          <chr>          <dbl>
## 1 84001001 US    USA     840  1001 Autauga Alabama        US              32.5
## 2 84001003 US    USA     840  1003 Baldwin Alabama        US              30.7
## 3 84001005 US    USA     840  1005 Barbour Alabama        US              31.9
## 4 84001007 US    USA     840  1007 Bibb    Alabama        US              33.0
## 5 84001009 US    USA     840  1009 Blount  Alabama        US              34.0
## 6 84001011 US    USA     840  1011 Bullock Alabama        US              32.1
## # … with 1,146 more variables: Long_ <dbl>, Combined_Key <chr>,
## #   Population <dbl>, `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>,
## #   `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>, `1/28/20` <dbl>,
## #   `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>, `2/1/20` <dbl>,
## #   `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>,
## #   `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>,
## #   `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>, …

A summary of each dataset has been provided through the included console messages. However, for the purposes of further analyses, data for cases and deaths will be combined (for global and US data individually).

Further data wrangling and transformation will be done for each section in this report. However, an initial tidying step has been included below to show the general structure that will be utilized for the analysis.

Tidying the Data

First, the global data will be tidied to provide an preliminary data frame. Population data, also provided by John Hopkins University, will be incorporated into the data frames.

#Tidy data - global
global_cases <- global_cases %>%
  pivot_longer(
    cols = -c(`Province/State`,
              `Country/Region`,
              Lat,
              Long),
    #dates in the columns, which serve as names
    #cases are the data in the dates columns, which are values 
    names_to = "date",
    values_to = "cases"
  ) %>%
  #clean up lat and long columns since they aren't needed
  select(-c(Lat, Long))


#tidy up global deaths
global_deaths <- global_deaths %>%
  pivot_longer(cols = -c(`Province/State`,
                         `Country/Region`,
                         Lat,Long),
               names_to = "date",
               values_to = "deaths") %>%
  select(-c(Lat, Long))

#join cases and death data
global <- global_cases %>%
  full_join(global_deaths) %>%
  rename(Country_Region = `Country/Region`,
         Province_State = `Province/State`) %>%
  #change date to date object
  mutate(date = mdy(date))

#missing population data for global data set. 
#need a combined key
global <- global %>%
  #create combined key using unite
  unite("Combined_Key",
    c(Province_State, Country_Region),
    #comma separator
    sep = ", ",
    na.rm = TRUE,
    remove = FALSE
  )

#need population data.
uid_lookup_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
uid <- read_csv(uid_lookup_url) %>%
  select(-c(Lat, Long_, Combined_Key, code3, iso2, iso3, Admin2))

#join by population
global <- global %>%
  left_join(uid, by = c("Province_State", "Country_Region")) %>%
  select(-c(UID, FIPS)) %>%
  select(Province_State, Country_Region, date,
         cases, deaths, Population,
         Combined_Key)

summary(global)
##  Province_State     Country_Region          date                cases          
##  Length:330327      Length:330327      Min.   :2020-01-22   Min.   :        0  
##  Class :character   Class :character   1st Qu.:2020-11-02   1st Qu.:      680  
##  Mode  :character   Mode  :character   Median :2021-08-15   Median :    14429  
##                                        Mean   :2021-08-15   Mean   :   959384  
##                                        3rd Qu.:2022-05-28   3rd Qu.:   228517  
##                                        Max.   :2023-03-09   Max.   :103802702  
##                                                                                
##      deaths          Population        Combined_Key      
##  Min.   :      0   Min.   :6.700e+01   Length:330327     
##  1st Qu.:      3   1st Qu.:5.790e+05   Class :character  
##  Median :    150   Median :6.574e+06   Mode  :character  
##  Mean   :  13380   Mean   :2.769e+07                     
##  3rd Qu.:   3032   3rd Qu.:2.642e+07                     
##  Max.   :1123836   Max.   :1.380e+09                     
##                    NA's   :10287

The US data can also be tidied in a similar way.

US_cases <- US_cases %>%
  #use combined key to retain cols
  pivot_longer(cols = -(UID:Combined_Key),
               names_to = "date",
               values_to = "cases") %>%
  #clean up some more
  #county name select
  select(Admin2:cases) %>%
  mutate(date = mdy(date)) %>%
  select(-c(Lat, Long_))

#do same for US deaths
US_deaths <- US_deaths %>%
  pivot_longer(cols = -(UID:Population),
               names_to = "date",
               values_to = "deaths") %>%
  #clean up some more
  #county name select
  select(Admin2:deaths) %>%
  mutate(date = mdy(date)) %>%
  select(-c(Lat, Long_))


#join data
US <- US_cases %>%
  full_join(US_deaths)

summary(US)
##     Admin2          Province_State     Country_Region     Combined_Key      
##  Length:3819906     Length:3819906     Length:3819906     Length:3819906    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       date                cases           Population           deaths       
##  Min.   :2020-01-22   Min.   :  -3073   Min.   :       0   Min.   :  -82.0  
##  1st Qu.:2020-11-02   1st Qu.:    330   1st Qu.:    9917   1st Qu.:    4.0  
##  Median :2021-08-15   Median :   2272   Median :   24892   Median :   37.0  
##  Mean   :2021-08-15   Mean   :  14088   Mean   :   99604   Mean   :  186.9  
##  3rd Qu.:2022-05-28   3rd Qu.:   8159   3rd Qu.:   64979   3rd Qu.:  122.0  
##  Max.   :2023-03-09   Max.   :3710586   Max.   :10039107   Max.   :35545.0

Note: Please click on the tabs at the top of this report to switch between sections.

Global Cases and Deaths

The Question

Main Question: What did the pandemic look like at its “peak” for each country?

There are many ways to define the “peak” of the pandemic. For the purposes of this analysis, the point in time in which each country reached its max number of COVID-19 cases per million people will be used. To provide more context, a visualization will be created that measures this variable against the number of COVID-19 deaths per million people.

Transforming the Data- “Peak” of the Pandemic

To visualize the relationship between these variables, a specialized scatter plot, known as a bubble chart, will be used. This visualization allows for an extra dimension to be visualized determined by the size of the bubble. Additionally, the bubbles will be colour-coded to represent the continents. In summary, the following elements are of note:
- The x-axis: COVID-19 cases per million people for a given country.
- The y-axis: COVID-19 deaths per million people for a given country.
- Bubble size: The population of a given country. Larger bubbles represent larger populations. Population data has been sourced from John Hopkins University, as outlined in the previous section.
- Bubble colour: The continent that a given country belongs to. This will be categorized using the countrycode package. There are some conflicts that arise from the data, which are handled below.

#aggregate data based on totals
global_total <- global %>%
  group_by(Country_Region, date) %>%
  summarize(cases = sum(cases), deaths = sum(deaths),
            Population = sum(Population)) %>%
  select(Country_Region, date, cases, deaths,
         Population) %>%
  ungroup()

#add continent to global
global_total$Continent <- countrycode(
  sourcevar = global_total$Country_Region,
  origin = "country.name",
  destination = "continent"
)
## Warning: Some values were not matched unambiguously: Diamond Princess, Kosovo, Micronesia, MS Zaandam, Summer Olympics 2020, Winter Olympics 2022
summary(global_total)
##  Country_Region          date                cases               deaths       
##  Length:229743      Min.   :2020-01-22   Min.   :        0   Min.   :      0  
##  Class :character   1st Qu.:2020-11-02   1st Qu.:     3831   1st Qu.:     46  
##  Mode  :character   Median :2021-08-15   Median :    52933   Median :    786  
##                     Mean   :2021-08-15   Mean   :  1379412   Mean   :  19238  
##                     3rd Qu.:2022-05-28   3rd Qu.:   499592   3rd Qu.:   7227  
##                     Max.   :2023-03-09   Max.   :103802702   Max.   :1123836  
##                                                                               
##    Population         Continent        
##  Min.   :8.090e+02   Length:229743     
##  1st Qu.:1.886e+06   Class :character  
##  Median :8.696e+06   Mode  :character  
##  Mean   :3.246e+07                     
##  3rd Qu.:2.769e+07                     
##  Max.   :1.380e+09                     
##  NA's   :8001

There are a number of missing values that are present in the data, particularly in the Population column. The output from the continent classification indicates that further investigation into these values is also required.

#investigate missing values and continent conflicts
non_country <- unique(global_total$Country_Region[which(is.na(global_total$Population))])
non_country
## [1] "Antarctica"           "Canada"               "China"               
## [4] "Diamond Princess"     "MS Zaandam"           "Summer Olympics 2020"
## [7] "Winter Olympics 2022"
non_con <- unique(global_total$Country_Region[which(is.na(global_total$Continent))])
non_con
## [1] "Diamond Princess"     "Kosovo"               "Micronesia"          
## [4] "MS Zaandam"           "Summer Olympics 2020" "Winter Olympics 2022"

Investigating the missing values reveals that some non-countries (and countries) have been included in the data, which is why they do not have corresponding population values. These values include cruise ships and Olympic events. This is valuable data, but for the purposes of this analysis, they will not be considered as countries. There seems to be some sort of error in displaying the populations of Canada and China most likely due to the way their provinces have been recorded. These values will be added manually.

There are also some countries that have not been classified by continent. Since they have been included in the COVID-19 data, they will be considered countries for the analysis. Kosovo will be considered a European country, and Micronesia will be considered an Oceanic country.

Finally, once the missing values have been resolved, the number of COVID-19 cases and deaths per million people per country can be added without errors arising.

#remove countries that aren't actually countries.
non_country <- non_country[!non_country %in% c("China", "Canada")]

global_total <- global_total %>%
  filter(!Country_Region %in% non_country)

sum(is.na(global_total$Population))
## [1] 2286
#update population values for China and Canada
pop_china <- uid$Population[uid$Country_Region == "China" &
                              is.na(uid$Province_State)]
pop_canada <- uid$Population[uid$Country_Region == "Canada" &
                               is.na(uid$Province_State)]

global_total <- global_total %>%
  mutate(Population = case_when(
    Country_Region == "China" & is.na(Population) ~ pop_china,
    Country_Region == "Canada" & is.na(Population) ~ pop_canada,
    TRUE ~ Population
  ))

#ensure that the number of countries seems reasonable
length(unique(global_total$Country_Region))
## [1] 196
#check population values are reasonable
min(global_total$Population)
## [1] 809
max(global_total$Population)
## [1] 1411778724
#Add in Kosovo and Micronesia's continents
kos <- global_total$Country_Region == "Kosovo"
global_total$Continent[kos] <- "Europe"
micro <- global_total$Country_Region == "Micronesia"
global_total$Continent[micro] <- "Oceania"

unique(global_total$Continent)
## [1] "Asia"     "Europe"   "Africa"   "Americas" "Oceania"
sum(is.na(global_total))
## [1] 0
#add cases and deaths per thousand people
global_total <- global_total %>%
    mutate(deaths_per_mill = deaths *1000000/ Population) %>%
    mutate(cases_per_mill = cases *1000000/ Population)

Visualizing the Data - “Peak” of the Pandemic

Now that the data has been transformed and the missing values have been addressed, the visualization can be created. Since a bubble chart of this scale has a great deal of information, an interactive graph is useful for users to further investigate the data. An interactive plot will be created using the plotly package.

#peak of the pandemic values
global_max_cases <- global_total %>%
  group_by(Country_Region) %>%
  slice(which.max(cases))

#capture American cases as a benchmark. 
amer_cases <- global_max_cases %>%
  filter(Country_Region == "US")

#create visualization
global_max_cases %>%
  plot_ly(
    type = "scatter",
    mode = "markers",
    #round values
    x = ~round(cases_per_mill, digits = 2),
    y = ~round(deaths_per_mill, digits = 2),
    size = ~Population,
    color = ~Continent,
    #alter size of markers slightly for clarity
    marker = list(sizeref = 0.4, sizemode = "area"),
    text = ~Country_Region
  ) %>%
  layout(
    title = "Peak Global COVID-19 Cases and Deaths per Million",
    xaxis = list(title = "Cases per Million"),
    yaxis = list(title = "Deaths per Million"),
    #add annotation to highlight the US.
    annotations = list(
      x = amer_cases$cases_per_mill,
      y = amer_cases$deaths_per_mill,
      text = amer_cases$Country_Region,
      xref = "x",
      yref = "y",
      showarrow = FALSE
    )
  )

This provides a snapshot of what each country’s “peak” looks like. However, this raises the question of how trends are measured over time. The “peak” does not reflect the nuances in case or death trajectories. These might occur due to considerations such as adopting new policies or rolling out vaccines. The data point for the United States has been labelled to serve as a benchmark.

Transforming the Data - Cases and Deaths over Time

Visualizing trends over time can aid in understanding how COVID-19 has affected countries over time. To do this, an animated plotly visualization can be created.

An important caveat is that the population for a given country will change over time. This is further complicated by reporting measures during the pandemic. For the purpose of this analysis, the populations used will remain static.

First, the data will be transformed to capture the cases and deaths per million per month for each country. In this case, the “per month” value will be obtained by taking the number of cases and deaths for the last day of each month. Since the data is cumulative, this will, by default, represent the number of cases at the end of a given month. Since the data only extends to 2023-03-09, this day will represent the last day of March 2023.

#find the range of the dates
range(global_total$date)
## [1] "2020-01-22" "2023-03-09"
#create a vector consisting of the last day of every month
month_dates <- seq(as.Date("2020-02-01"), length = 39, by = "1 month")-1
#replace value for March - data only goes up to March 9
month_dates <- replace(month_dates,month_dates == "2023-03-31", as.Date("2023-03-09"))
#filter global data set by last day of every month. 
global_monthly_totals <- global_total %>%
  filter(date %in% month_dates)

#the "animation" frame works by transforming dates to factors.
global_monthly_totals$date <- as.factor(global_monthly_totals$date)

Visualizing the Data - Cases and Deaths over Time

Note: To see the reported cases and deaths per million each month by country, click the “play” button below. The slider can be used to navigate to a specific month. The date is displayed above the slider.

global_monthly_totals %>%
  plot_ly(
    type = "scatter",
    mode = "markers",
    x = ~round(cases_per_mill, digits = 2),
    y = ~round(deaths_per_mill, digits = 2),
    size = ~Population,
    color = ~Continent,
    marker = list(sizeref = 0.4, sizemode = "area"),
    #animate using the frame parameter
    frame = ~date,
    text = ~Country_Region
  ) %>%
  layout(
    title = "Global COVID-19 Cases and Deaths per Million Over Time",
    xaxis = list(title = "Cases per Million"),
    yaxis = list(title = "Deaths per Million")
  ) %>%
  #alter the speed slightly for visual clarity
  animation_opts(
    750,
    redraw = FALSE
  ) %>%
  #rename the prefix that appears before the date
  animation_slider(
    currentvalue = list(prefix = "Date: ")
  )

Analysis and Future Directions

amer_cases
## # A tibble: 1 × 8
## # Groups:   Country_Region [1]
##   Country_Region date          cases deaths Population Continent deaths_per_mill
##   <chr>          <date>        <dbl>  <dbl>      <dbl> <chr>               <dbl>
## 1 US             2023-03-09   1.04e8 1.12e6  329466283 Americas            3411.
## # … with 1 more variable: cases_per_mill <dbl>
global_max_cases[which.max(global_max_cases$cases_per_mill),]
## # A tibble: 1 × 8
## # Groups:   Country_Region [1]
##   Country_Region date       cases deaths Population Continent deaths_per_mill
##   <chr>          <date>     <dbl>  <dbl>      <dbl> <chr>               <dbl>
## 1 San Marino     2023-03-07 23616    122      33938 Europe              3595.
## # … with 1 more variable: cases_per_mill <dbl>
global_max_cases[which.max(global_max_cases$deaths_per_mill),]
## # A tibble: 1 × 8
## # Groups:   Country_Region [1]
##   Country_Region date         cases deaths Population Continent deaths_per_mill
##   <chr>          <date>       <dbl>  <dbl>      <dbl> <chr>               <dbl>
## 1 Peru           2023-03-08 4487553 219539   32971846 Americas            6658.
## # … with 1 more variable: cases_per_mill <dbl>
mean(global_max_cases$cases_per_mill)
## [1] 169917

The country with the most COVID-19 cases per million at its peak is San Marino. San Marino has a population of approximately 34,000 and had close to 24,000 cases at its peak.The country with the most deaths per million is Peru. On average, there were 169,917 cases per million people across all countries when considering the peak for each country. America, at its peak, had 315,063 cases per million people.

When considering trends by continental region, European countries have varying cases per million, but middling deaths per million. Countries in the Americas and Africa are generally on the lower end of cases and deaths per million. A strong trend does not persist for Asian countries. Oceania has varying cases per million, but low deaths per million.

The temporal changes in cases and deaths per million can also be analyzed by observing the changes in the number of cases per month. The ratio between cases and deaths per million remains fairly consistent on a country-by-country level until the end of 2021. There is a noticeable spike in the number of cases per million at this time, but not a large shift in the number of deaths per million. A number of factors could contribute toward this observed trend, such as vaccination rates, policy changes, and COVID-19 variants.

Future directions for analysis include some of the epidemiological considerations for the spread of viruses. The incorporation of variables regarding vaccines, policies regarding practices for COVID-19, awareness campaigns, and hospitalization rates serve as interesting future directions for better understanding the trends observed in the spread of COVID-19.

US Data - Regional Cases

The Question

Main Question: How did the United States fare in terms of cases and deaths at the state level?

To explore the effects of COVID-19 in the US, a choropleth map can be created. The ratio between cases and deaths, or fatalities, can serve as a measure of the performance of each individual state.

Transforming the Data - US Case to Fatality Ratio

To investigate further, the data must be tidied further to handle missing values and aggregate cases at the state level.

#subset values that are missing
US_missing <- US[which(is.na(US$Admin2)),]
US_missing_PS <- unique(US_missing$Province_State)
US_missing_PS
## [1] "American Samoa"           "Diamond Princess"        
## [3] "Grand Princess"           "Guam"                    
## [5] "Northern Mariana Islands" "Virgin Islands"
#filter data
US <- US %>%
  filter(!Province_State %in% US_missing_PS)
sum(is.na(US))
## [1] 0

Some of the missing values stem from the inclusion of US territories. While this is valuable data, the analysis is focused on the effect of the pandemic on the states of the US. As such, these values can be safely removed without having an impact on the analysis.

The last available day of the data will be used to determine the case to fatality ratio, as the cumulative data will represent the totality of these variables.

#cases/deaths by state
US_by_state <- US %>%
  #group data by state, region
  group_by(Province_State, Country_Region, date) %>%
  #for each state, take the cases to be the sum of cases in the state
  summarize(cases = sum(cases), deaths = sum(deaths),
            Population = sum(Population)) %>%
  #filter only last day in data frame
  filter(date == as.Date("2023-03-09")) %>%
  #add case:fatality ratio
  mutate(cases_to_deaths = cases/deaths) 

#check which states aren't supposed to be part of the 50
#state.name is a built-in vector. 
not_states <- US_by_state[!US_by_state$Province_State %in% state.name ,]
not_states
## # A tibble: 2 × 7
## # Groups:   Province_State, Country_Region [2]
##   Province_State       Country_Region date         cases deaths Population
##   <chr>                <chr>          <date>       <dbl>  <dbl>      <dbl>
## 1 District of Columbia US             2023-03-09  177945   1432     705749
## 2 Puerto Rico          US             2023-03-09 1101469   5823    3754939
## # … with 1 more variable: cases_to_deaths <dbl>
US_by_state <- US_by_state %>%
  filter(!Province_State %in% not_states$Province_State)


#add state abbreviations for graphing purposes.
US_by_state$State_Abbr <- state.abb

To complete the visualization, the data included must only be the 50 US states. Puerto Rico and the District of Columbia have been included in the COVID-19 data at the state level. For the purpose of this analysis, they will be removed to maintain the focus on states. Abbreviations for each state have been including using the built-in R state data. These are required for the plot_geo function.

Visualizing the Data - US Case to Fatality Ratio

US_by_state %>%
  #indicate that US states are being used
  plot_geo(
    locationmode = 'USA-states'
  ) %>%
  add_trace(
    z= ~cases_to_deaths,
    locations = ~State_Abbr,
    color= ~cases_to_deaths,
    hoverinfo = 'text',
    #alter the text that appears on hover
    text = ~paste(
      '</br> State: ', Province_State, ' (',State_Abbr ,')',
      '</br> Ratio: ', round(cases_to_deaths, digits = 2),
      '</br> Cases: ', round(cases, digits = 2),
      '</br> Deaths: ', round(deaths, digits = 2)
    )
  ) %>%
  layout(
  #zoom into graph so that only the US is seen.
   geo = list(scope = 'usa'),
   title = "COVID-19 Case to Fatality Ratio for US States"  
   ) %>%
  #change legend title - slightly different procedure for continuous scale legends.
  colorbar(title = "Case to Fatality Ratio")
US_by_state %>%
  #"ungroup" here for the following operations.
  ungroup() %>%
  arrange(desc(cases_to_deaths)) %>%
  slice(1, length(US_by_state$cases_to_deaths))
## # A tibble: 2 × 8
##   Province_State Country_Region date         cases deaths Population
##   <chr>          <chr>          <date>       <dbl>  <dbl>      <dbl>
## 1 Alaska         US             2023-03-09  307655   1486     740995
## 2 Pennsylvania   US             2023-03-09 3527854  50398   12801989
## # … with 2 more variables: cases_to_deaths <dbl>, State_Abbr <chr>
mean(US_by_state$cases_to_deaths)
## [1] 101.2346

Future Directions and Analysis

The state with the lowest case to death ratio is Pennsylvania. The highest ratio is Alaska. On average, the case to fatality ratio was 101 cases:deaths. While the ratio accurately represents the relationship between cases and deaths, it does not reflect the population sizes of each state.

This ratio does provide some insight into COVID-19 activity at the state level, but it may not capture the nuances of COVID-19 transmission. Future directions may include the integration of vaccination rates and population sizes. Additionally, further analysis of states grouped by their respective regions would provide higher-order trends.

Model - COVID-19 in the Americas

The Question

Main question: Is there a relationship between the number of cases per million and deaths per million for countries in the Americas?

To answer this question, a simple linear model will be created to determine if it is an appropriate way to describe the relationship between cases and deaths per million.

Transforming the Data - COVID-19 in the Americas

First, the “peak” COVID-19 data will be repurposed. It will be filtered to only contain values for countries in the Americas. Please note that the “peak” here represents the day that the cases reached a maximum point for individual countries.

#filter max cases data 
americas_max_cases <- global_max_cases %>%
  filter(Continent == "Americas")

#preliminary missing checks 
sum(is.na(americas_max_cases))
## [1] 0

Visualizing the Data - COVID-19 in the Americas

#linear model
mod <- lm(deaths_per_mill ~ cases_per_mill, data = americas_max_cases)
summary(mod)
## 
## Call:
## lm(formula = deaths_per_mill ~ cases_per_mill, data = americas_max_cases)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1398.2  -727.6  -193.9   462.6  4751.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    1.054e+03  3.473e+02   3.034  0.00468 **
## cases_per_mill 6.269e-03  2.141e-03   2.928  0.00614 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1135 on 33 degrees of freedom
## Multiple R-squared:  0.2062, Adjusted R-squared:  0.1822 
## F-statistic: 8.574 on 1 and 33 DF,  p-value: 0.006138
#use linear model to predict deaths based on linear model.
americas_max_cases_pred <- americas_max_cases %>%
  ungroup() %>%
  mutate(pred = predict(mod))

#to see how good our model is, we can plot predictions against actual values.
americas_max_cases_pred %>%
  plot_ly(
    type = "scatter",
    mode = "markers",
    x = ~cases_per_mill,
    y = ~deaths_per_mill,
    text = ~Country_Region,
    name = "Actual"
  ) %>%
  add_trace(
    x = ~cases_per_mill,
    y = ~pred,
    text = ~Country_Region,
    name = "Predicted"
  ) %>%
  layout(
    title = "Predicted and Actual COVID-19 Cases and Deaths per Million - The Americas",
    xaxis = list(title = "Cases per Million"),
    yaxis = list(title = "Deaths per Million")
  )

There is a clear outlier present in the data. Peru has an extremely high deaths per million value in comparison to its cases per million. As an exercise, Peru’s values will be removed to observe how its effect on the analysis.

americas_max_cases_outlier <- americas_max_cases %>%
  filter(!Country_Region == "Peru")

mod1 <- lm(deaths_per_mill ~ cases_per_mill, data = americas_max_cases_outlier)
summary(mod1)
## 
## Call:
## lm(formula = deaths_per_mill ~ cases_per_mill, data = americas_max_cases_outlier)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1257.13  -586.02   -73.88   593.87  1347.19 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.160e+02  2.386e+02   3.839 0.000549 ***
## cases_per_mill 6.254e-03  1.464e-03   4.271 0.000163 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 776.5 on 32 degrees of freedom
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3431 
## F-statistic: 18.24 on 1 and 32 DF,  p-value: 0.0001628
americas_max_cases_outlier_pred <- americas_max_cases_outlier %>%
  ungroup() %>%
  mutate(pred = predict(mod1))

americas_max_cases_outlier_pred %>%
  plot_ly(
    type = "scatter",
    mode = "markers",
    x = ~cases_per_mill,
    y = ~deaths_per_mill,
    text = ~Country_Region,
    name = "Actual"
  ) %>%
  add_trace(
    x = ~cases_per_mill,
    y = ~pred,
    text = ~Country_Region,
    name = "Predicted"
  ) %>%
  layout(
    title = "Predicted and Actual COVID-19 Cases and Deaths per Million - The Americas",
    xaxis = list(title = "Cases per Million"),
    yaxis = list(title = "Deaths per Million")
  )

Future Directions and Analysis - COVID-19 in the Americas

#min/max values
americas_max_cases_pred %>%
  ungroup() %>%
  #arrange in ascending order - least to most dpm
  arrange((deaths_per_mill)) %>%
  slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
  select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
##    Country_Region  cases_per_mill deaths_per_mill  pred date        cases deaths
##    <chr>                    <dbl>           <dbl> <dbl> <date>      <dbl>  <dbl>
##  1 Nicaragua                2363.            37.0 1068. 2023-03-08 1.57e4 2.45e2
##  2 Haiti                    3000.            75.4 1072. 2023-03-04 3.42e4 8.6 e2
##  3 Venezuela               19418.           206.  1175. 2023-03-09 5.52e5 5.85e3
##  4 Dominican Repu…         60914.           404.  1436. 2023-03-08 6.61e5 4.38e3
##  5 El Salvador             31110.           651.  1249. 2022-08-30 2.02e5 4.22e3
##  6 Trinidad and T…        135705.          3112.  1904. 2023-03-07 1.90e5 4.36e3
##  7 Brazil                 174451.          3290.  2147. 2023-03-03 3.71e7 6.99e5
##  8 Chile                  271617.          3362.  2756. 2023-03-09 5.19e6 6.43e4
##  9 US                     315063.          3411.  3029. 2023-03-09 1.04e8 1.12e6
## 10 Peru                   136103.          6658.  1907. 2023-03-08 4.49e6 2.20e5
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_pred %>%
  ungroup() %>%
  #arrange in ascending order - least to most dpm
  arrange((pred)) %>%
  slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
  select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
##    Country_Region cases_per_mill deaths_per_mill  pred date         cases deaths
##    <chr>                   <dbl>           <dbl> <dbl> <date>       <dbl>  <dbl>
##  1 Nicaragua               2363.            37.0 1068. 2023-03-08  1.57e4 2.45e2
##  2 Haiti                   3000.            75.4 1072. 2023-03-04  3.42e4 8.6 e2
##  3 Venezuela              19418.           206.  1175. 2023-03-09  5.52e5 5.85e3
##  4 El Salvador            31110.           651.  1249. 2022-08-30  2.02e5 4.22e3
##  5 Honduras               47680.          1122.  1353. 2023-03-06  4.72e5 1.11e4
##  6 Panama                239116.          1995.  2553. 2023-03-06  1.03e6 8.61e3
##  7 Chile                 271617.          3362.  2756. 2023-03-09  5.19e6 6.43e4
##  8 Uruguay               297750.          2193.  2920. 2023-02-27  1.03e6 7.62e3
##  9 US                    315063.          3411.  3029. 2023-03-09  1.04e8 1.12e6
## 10 Barbados              371638.          2015.  3384. 2023-03-09  1.07e5 5.79e2
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_outlier_pred %>%
  ungroup() %>%
  arrange((deaths_per_mill)) %>%
  slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
  select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
##    Country_Region  cases_per_mill deaths_per_mill  pred date        cases deaths
##    <chr>                    <dbl>           <dbl> <dbl> <date>      <dbl>  <dbl>
##  1 Nicaragua                2363.            37.0  931. 2023-03-08 1.57e4 2.45e2
##  2 Haiti                    3000.            75.4  935. 2023-03-04 3.42e4 8.6 e2
##  3 Venezuela               19418.           206.  1037. 2023-03-09 5.52e5 5.85e3
##  4 Dominican Repu…         60914.           404.  1297. 2023-03-08 6.61e5 4.38e3
##  5 El Salvador             31110.           651.  1111. 2022-08-30 2.02e5 4.22e3
##  6 Argentina              222254.          2887.  2306. 2023-03-06 1.00e7 1.30e5
##  7 Trinidad and T…        135705.          3112.  1765. 2023-03-07 1.90e5 4.36e3
##  8 Brazil                 174451.          3290.  2007. 2023-03-03 3.71e7 6.99e5
##  9 Chile                  271617.          3362.  2615. 2023-03-09 5.19e6 6.43e4
## 10 US                     315063.          3411.  2886. 2023-03-09 1.04e8 1.12e6
## # … with 2 more variables: Population <dbl>, Continent <chr>
americas_max_cases_outlier_pred %>%
  ungroup() %>%
  arrange((pred)) %>%
  slice(1:5, (length(deaths_per_mill)-4):length(deaths_per_mill)) %>%
  select(Country_Region,cases_per_mill, deaths_per_mill, pred, everything())
## # A tibble: 10 × 9
##    Country_Region cases_per_mill deaths_per_mill  pred date         cases deaths
##    <chr>                   <dbl>           <dbl> <dbl> <date>       <dbl>  <dbl>
##  1 Nicaragua               2363.            37.0  931. 2023-03-08  1.57e4 2.45e2
##  2 Haiti                   3000.            75.4  935. 2023-03-04  3.42e4 8.6 e2
##  3 Venezuela              19418.           206.  1037. 2023-03-09  5.52e5 5.85e3
##  4 El Salvador            31110.           651.  1111. 2022-08-30  2.02e5 4.22e3
##  5 Honduras               47680.          1122.  1214. 2023-03-06  4.72e5 1.11e4
##  6 Panama                239116.          1995.  2411. 2023-03-06  1.03e6 8.61e3
##  7 Chile                 271617.          3362.  2615. 2023-03-09  5.19e6 6.43e4
##  8 Uruguay               297750.          2193.  2778. 2023-02-27  1.03e6 7.62e3
##  9 US                    315063.          3411.  2886. 2023-03-09  1.04e8 1.12e6
## 10 Barbados              371638.          2015.  3240. 2023-03-09  1.07e5 5.79e2
## # … with 2 more variables: Population <dbl>, Continent <chr>

Base model: In the base model, the model has an intercept value of 1054. The slope value is 0.006269; for every additional case per million, it is expect that there is an additional 0.006269 deaths per million. The model has an adjusted R-squared value of 0.1822 and a residual standard error of 1135 with 33 degrees of freedom. In this version, Nicaragua has the lowest actual and predicted deaths per million. Peru has the highest actual deaths per million, whereas Barbados has the highest predicted deaths per million.

Alternate model - outlier removed : In this version of the model, the model has an intercept value of 916. The slope value is 0.006254 for every additional case per million, it is expected that there is an additional 0.006254 deaths per million. The model has an adjusted R-squared value of 0.3431 and a residual standard error of 776.5 with 32 degrees of freedom. In this version, Nicaragua has the lowest actual and predicted deaths per million. The United States has the highest actual deaths per million, and Barbados has the highest predicted deaths per million.

When outliers are removed, there is a moderate linear correlation between the cases and deaths per million in the Americas. This analysis is a preliminary one to demonstrate the correlation between just two variables. Future directions can expand and potentially improve on this model. Some future directions that this model may take include vaccination rates, temporal effects, and input related to variants. Different versions of this model may also expand the data to include other countries, or subdivide the data into “North” and “South” America.

Conclusion

Sources of Bias

The visualizations and analysis presented in this report provide insight into the trends and patterns regarding the COVID-19 pandemic. However, it is important to examine potential sources of bias when drawing conclusion about the results.

One potential source of bias lies in the reporting measures. Global data contains confirmed cases and deaths as a result of COVID-19. If the way in which data is captured and recorded relating this data is not consistent across individual countries, this may be reflected as inaccuracies or inconsistencies in the analysis.

On a systemic level, interpretation of the data needs to be handled with care. It is important to be mindful of how access to hospital equipment, vaccines, and personal protective equipment may be reflected in the number of cases and deaths. Furthermore, the cultural conception of public health may result in differing approaches to combating COVID-19, which may also be reflected in observed trends. Another systemic consideration that may have an affect on the analysis is the racial and ethnic disparities in COVID-19 infection, hospitalization, and transmission that may persist due to a number of socioeconomic factors.

It is possible that personal biases may have influenced the data analysis. Topics of contention that have the potential to influence the analysis include a Western conception of medicine, health, and treatment; full support of vaccinations being mandatory; a scientific background in biology and public health; and the benefit of hindsight when drawing conclusions about the data after the pandemic has ended. To mitigate these personal biases, the analysis presented is focused on objective results and information. The conclusions drawn pertain only to what can be ascertained statistically. While certain forms of biases cannot be fully prevented, they have been recorded to promote transparency in data analysis.

Conclusion

The report presented has produced results related to the temporal and geospatial aspects of the COVID-19 pandemic. There are a number of future directions that can be taken to expand or enhance the results presented. Furthering the understanding of a global pandemic will help promote an understanding in preventing and controlling future global pandemics.